fix: Add support for unsigned Arrow datatypes in schema conversion #1617

gkpanda4 · 2025-08-18T20:53:44Z

Which issue does this PR close?

Closes bug: ArrowSchemaConverter can't handle unsigned datatypes from arrow #675

What changes are included in this PR?

Bug Fixes

Fixed crash when ArrowSchemaConverter encounters unsigned datatypes
Resolved "Unsupported Arrow data type" errors for UInt8/16/32/64

Features

Added casting support for unsigned Arrow types
UInt8/16 → Int32 (safe casting to larger signed type)
UInt32 → Int64 (safe casting to larger signed type)
UInt64 → Error (no safe casting option, explicit error with guidance)

Code Changes

Enhanced ArrowSchemaConverter primitive() method with unsigned type handling
Added comprehensive test: test_unsigned_type_casting() for all unsigned variants

Files Modified

crates/iceberg/src/arrow/schema.rs

Impact

✅ No breaking changes - existing functionality preserved
✅ Safe type casting prevents overflow issues
✅ Clear error messages for unsupported UInt64 with alternatives
✅ Follows proven PyIceberg implementation approach

Are these changes tested?

All existing schema tests pass
New comprehensive test covers UInt8, UInt16, UInt32, UInt64 conversion behavior
Test verifies proper casting: UInt8/16→Int32, UInt32→Int64, UInt64→Error

emkornfield · 2025-08-20T21:00:10Z

Sorry, new to reviewing (and mostly new to the code base) so take comments with a grain of salt, but this approach seems brittle:

What happens if someone updates the doc field that removes the type information and Arrow RS tries to read the back?
What happens if a non-Arrow RS tries to read data written from Arrow with these fields (in particular int32 and int64)?

It seems a more robust solution would be to:

Convert uint32->int64
Either still block uint64, convert uint64 to a Decimal with an appropriate precision to represent the full range, or use int64 and validate that no values written are outside the appropriate range.

CTTY · 2025-08-20T23:59:33Z

I have the same concern as @emkornfield , using doc to determine field type seems unsafe to me. I think casting the type should be fine. This way there would be type loss when converting Iceberg schema back to arrow schema but it should be ok.

Also the Python's implementation can serve as a good reference. Note that Pyiceberg use bid width while arrow-rs only provides primitive_width() that returns width in bytes

gkpanda4 · 2025-08-21T00:29:33Z

@emkornfield Right, with the current approach, it has potential for silent data corruption because of Arrow's doc field dependency. @CTTY Thanks for the references, I will use this for casting.

My updated approach uses safe bit-width casting for unsigned integer types following the proven iceberg-python implementation

• uint8/uint16 → int32: Safe upcast with no overflow risk
• uint32 → int64: Safe upcast preserving full uint32 range
• uint64 → explicit block: Rather than risk data loss through unsafe conversion, provide clear error guidance directing users to choose between int64 (with range validation) or decimal (with full precision) based on their specific requirements

Let me know if there are any concerns, I will have the changes out otherwise.

emkornfield · 2025-08-21T03:37:35Z